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Abstract. Methods for choosing a fixed set of knot locations in additive sphne models 
are fairly well established in the statistical literature. While most of these methods are 
in principle directly extendable to non-additive surface models, they are less likely to be 
successful in that setting because of the curse of dimensionality, especially when there are 
more than a couple of covariates. We propose a regression model for a multivariate Gaussian 
response that combines both additive splines and interactive splines, and a highly efhcient 
MCMC algorithm that updates all the knot locations jointly. We use shrinkage priors to 
avoid overfitting with different estimated shrinkage factors for the additive and surface part of 
the model, and also different shrinkage parameters for the different response variables. This 
makes it possible for the model to adapt to varying degrees of nonlinearity in different parts 
of the data in a parsimonious way. Simulated data and an application to firm leverage data 
show that the approach is computationally efficient, and that allowing for freely estimated 
knot locations can offer a substantial improvement in out-of-sample predictive performance. 
Keywords: Bayesian inference, Markov chain Monte Carlo, Surface regression. Splines, 
Free knots. 



1. Introduction 



Flexible model s of the regres s ion fu nction E{y\x) has been an active research field for 
decades, see e.g. iRuppert et al.l ( 120031 ) for a recent textbook introduction and f urther ref- 
erences. Intensive research was initially devoted to kernel regression methods (iNadaraya 
19641 lwatso3l964i . loasser fc Miilleij Il979l ). and later followed by a large literature on spline 



regression modeling. A spline is a linear regression on a set of nonlinear basis functions of 
the original regressors. Each basis function is defined from a knot in regressor space and the 
knots determine the points of flexibility of the fitted regression function. This gives rise to a 
locally adaptable model with continuity at the knots. 



Li (corresponding author): Department of Statistics, Stockholm University, SE-106 91 Stockholm, Sweden. 
E-mail: f eng. TiOstat . su. se[ Villani: Division of Statistics, Department of Computer and Information 
Science, Linkoping University, SE-581 83 Linkoping, Sweden. E-mail: Imattias . villajiiQliu . sel 



EFFICIENT BAYESIAN MULTIVARIATE SURFACE REGRESSION 



2 



The most widely used models assume additivity in the regressors, i .e. E(j/ | a:i Xq) = 

y^i'j-i fj(xj), where fj{xj) is a spline function for the jth regressor (IHastie fc Tibshirani 



19901 ). Assuming additivity is clearly a very convenient simplification, but it is also some- 



what unnatural to make such a strong assumption in an otherwise very flexible model. This 
has motivated research on surface models with interactions between regressors. One line 
of research extends the additive models by including higher-order interactions of the spline 



Hastie et al. 


( 


— „ 
2009) 


Friedman 


Jl991 


)is 



tering the model using a greedy algorithm. Regres sion trees (IBreiman et al.lll984l) is another 
popular class of models, with the BART model in IChipman et al.l (120101 ) as its most promi- 
nent Bayesian member. Our paper follows a r ecent strand of li terature that models surfaces 
using radial basis functions splines, see e.g. iBuhmannI ( l2003l ). A radial basis function is 
defined in R'' and has a value that depends only on the distance from a covariate vector (a:;) 
to its g-dimensional knot (^), e.g. the cubic radial basis ||a; — ^|| , where 
^ = (^1, ...,C,q)' and ||-|| is the Euclidean norm. The model is again linear in the basis expanded 
space. 

The basic challenge in spline regression is the choice of knot locations. This problem is 
clearly much harder for a general surface than it is for additive models since any manage- 
able set of g-dimensional knots are necessarily sparse in R'' when q is moderate or large, a 
manifestation of the curse of dimensionality. The state-of-the-art inferential procedures place 
the knots at the centroids from a clustering of the regressor observations. The selected knot 
locations are kept fixed throughout the analysis. To prevent overfitting, Bayesian variable 
selection methods are used to automatically remove or downweight the influ ence of the knots 
using Markov chain Monte Carlo (MCM C) methods fjSmith fc Kohnlll996l ). The reversible 
jump MCMC (RJMCMC) in for example 



Denison et al. 



({2002 ) treats the number of knots as 



unknown subject to an upper bound, but the location of the knots are still fixed throughout 
the analysis. 

Using a fixed set of knot locations is impractical when estimating a surface with more than 
a few regressors. An algorithm that can move the knots rapidly over the regressor space is ex- 
pected to be a clear improvement. All previous attempts have focused on efficient selection of 



fixed knots, and have pai d little attention t o mov i ng the knots. 
RJMCMC approaches in Ipimatteo et aL (2001), Denison et al. 



' he oth e rwise very elaborate 
f |l998l 



Gulam Razul et al 
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([20031) and iHolmes &: Mallickl ( 120031 ) all include a very simple MCMC update where a single 
knot is re-located using a Metropolis random walk step with a proposal variance that is the 
same for all knots. There are typically strong dependencies between the knots, and local one- 
knot-at-a-time moves will lead to slow convergence of the algorithm and inability to escape 
from local modes, see Section 15.41 for some evidence. This is especially true in the surface 
case with more than a couple of regressors. 

The main contribution in this paper is a highly efficient MCMC algorithm for the Gaussian 
multivariate surface regression where the locations of all knots are updated jointly. Rapid 
mixing of the knot locations is obtained from the following two features of our algorithm. 
First, the knots are simulated from a marginal posterior where the high- dimensional regres- 
sion coefficients have been integrated out analytically. Second, the knots' proposal distri- 
bution is tailored to the posterior distribution using the posterior gradient, which we derive 
in compact analytical form and evaluate efficiently by a careful use of sparsity. We use a 
shrinkage prior on the regression coefficients to prevent overfitting, where the shrinkage hy- 
perparameters are treated as unknowns and are estimated in a separate updating step. Also 
this step is tailored to the posterior using the gradient in analytical form. 

Even a highly efficient MCMC algorithm is likely to have problems exploring the joint 
posterior of many surface knots in a high-dimensional covariate space. To deal with this, our 
model is decomposed into three parts: i) the original covariates entering in linear form, ii) 
additive spline basis functions and iii) radial basis functions for capturing the remaining part 
of the surface and interactions. The idea is to let the additive part of the model capture the 
bulk of the nonlinearities so that the radial basis functions can focus exclusively on modeling 
the interactions. This way we can keep the number of knots in the interaction part of the 
model to a minimum, which is beneficial for MCMC convergence. We use separate shrinkage 
priors for the three parts of the model. Moreover, we also allow for separate shrinkage 
parameters in each response equation. This gives us an extremely flexible yet potentially 
parsimonious model where we can shrink out e.g. the surface part of the model in a subset 
of the response equations. 

Our MCMC scheme is designed for a flxed number of knots, and we select the number of 
knots by Bayesian cross-validation of the log predictive score using parallel computing, see 
Section 13.31 This has the disadvantage of not accounting for the uncertainty regarding the 
number of knots as is done in RJMCMC schemes, but the beneflts are substantially more 
robustness to variations in the prior and improved MCMC efficiency. 
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We illustrate our algorithm on simulated and real data, and compare the predictive perfor- 
mance of the models using Bayesian cross-validation techniques. We find that the free knots 
model constantly outperforms the model with fixed knots. Additionally, we find it is easier 
to obtain better fitting result by combining additive knots and surface knots in the model. 



2. BAYESIAN MULTIVARIATE SURFACE REGRESSION 

2.1. The model. Our proposed model is a Gaussian multivariate regression with three sets 
of covariates: 

Y = XoBo + Xa{ia)Ba + + E, (1) 

where "K(n x p) contains n observations on p response variables, and the rows of E are 
error vectors assumed to be iid Np(0, S). The matrix Xo{n x g^) contains the original 
regressors (first column is a vector of ones for the intercept) and B^, holds the corresponding 
regression coefficients. The Qa columns of the matrix Xa{$,a) are additive splines functions 
of the covariates in Xq. Our notation makes it clear that Xa depends on the knots ^q. Note 
that the knots in the additive part of the model are scalars, and that our model allows for 
unequal number of knots in the different covariates. Finally, Xs{$,s) contains the surface, or 
interaction, part of the model. The knots in are go-dimensional vectors. Note how this 
decomposition makes it possible for the additive part of the model to capture the main part 
of the nonlinearities so that the number of knots in Xg is kept to a minimum. We will refer 
to the three different parts of the model as the linear component, the additive component and 
the surface component, respectively. We will refer to and as the additive and surface 
knots, respectively. Likewise, Ba and Bg are the additive and surface coefficients. 

There are a large number of different spline bases that one can use for th e additive part 
of the model. The menu of choices for the surface basis is more limited, see Denison et al. 



(120021 ) for a survey of the most commonly used bases. We will use thin-plate splines for 
illustration, but our approach can be used with any basis with trivial changes, see Section [3] 
and Appendix |X] for computational details. The thin-plate spline basis in the surface case is 
of the form 

'^sji.^sj) \\'^o ^sjr'll 111 II "^o ^sjll) 3 ^1 (2) 

where is one of the original data points and is the jth g^- dimensional surface knot. 
The univariate thin-plate basis used in the additive part is a special case of the multivariate 
thin-plate in ([2]) where both the data point and the knot are one-dimensional. 
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For notational convenience, we sometimes write model in compact form 

Y = XB + E, 

where X = [Xg, Xa, Xg] is the nx q design matrix {q = qo + C[a + Qs) and B = [B'^, B[^, B2' ■ 
Define also bj = vec-Bj as the vectorization of the coefficients matrix Bi, and b = [b'^, b^, 6^]'. 

For a given set of fixed knot locations, the model in ([T]) is linear in the regression coefficients 
B. As explained in the Introduction, the great challenge with spline models is the choice of 
knot locations. This is especially true in the surface case where the curse of dimensionality 
makes it really hard to distribute the multi-dimensional knots in M''° in an effective way. To 
get a fair coverage of knots in the covariate space, a recommended approach is to place the 
knots at the cluster centers from some clustering algo r ithm^ e.g. fc-means cluste r ing o r using 
a mixture of multivariate normals, see 



Smith fc KohnI (119961 ) and 



Denison et al. 



(119981 ). This 

typically leads to many redundant knots (since the response variables are not used to aid 
the clustering) which is a source of overfitting. One solu tion is to remove (downweight) the 



knots by Bayesian variab l e selec tion ( Smith fc Kohn 

h00l\ \ and 



Dimatteo et al 



Denison et al 



19961) . possibly in a RJMCMC approach. 
Nevertheless, using a set of pre- 



(120021) 



see e.g. 

determined knots is unlikely to work well in the surface case with more than a handful of 
regressors. 

We will treat the knot locations in and as unknown parameters to be estimated. 
This is in principle straightforward from a Bayesian point of view, but great care is needed 
in the actual implementation of the posterior computations. We propose an efficient MCMC 
scheme for sampling from the joint posterior of the all knot locations and the regression 
coefficients, see Section I3] for details. The model is clearly highly (over)parametrized and in 
need of some regularization of the parameters. The two main regularization techniques in 
Bayesian analysis are shrinkage priors and variable (knot) selection priors. Variable selection 
can in principle be incorporated in the analysis, but would be computationally demanding 
since the number of gradient evaluations needed in our MCMC algorithm would increase 
dramatically. This is important since evaluating the gradient with respect to the knots is 
time-consuming as the knot locations enter the likelihood in a very complicated nonlinear way; 
see Section l3l2l for details. Moreover, part of the attraction of variable selection is that they 
also provide interpretable measures of variable importance; this is much less interesting here 
since the covariates correspond to knot locations, which are not interesting in themselves. We 
have therefore chosen to achieving parsimony with shrinkage priors that pull the regression 
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coefficients towards zero (or any other reference point if so desired), see Section 12.21 for 
details. We allow for separate shrinkage parameters for the linear, additive and surface parts 
of the model, and separate shrinkage parameters for the p responses within each of the three 
model parts. The shrinkage parameters are treated as unknowns and estimated, so that, for 
example, the surface part can be shrunk towards zero if this agrees with the data. Allowing 
the knots to move freely in covariate space introduces a knot switching problem similar to 
the well-known label switching problem in mixture models. The likelihood is invariant to a 
switch of two knot locations and their regression coefficients. This lack of identification is 
not important if our aim is to model th e regression su rface E{y\x), without regard to the 
posterior of the individual knot locations ( lGewekdl2007l ). Also, the MCMC draws of the knot 
locations can also be used to construct heat maps in covariate space to represent the density 
of knots in a certain regions, see Section [5l Such heat maps are clearly also immune to the 
knot switching problem. 



2.2. The prior. We now introduce an easily specified shrinkage prior for the three sets of 
regression coefficients Bo, Ba and Bs and the covariance matrix S, conditional on the knots. 
The prior for b and S are set as 

IW (rioS'o, no) , 



vecSj|S, Aj 
S 



, i e {o,a,s}, 



with prior independence between the Bi. The prior mean of vecBi is Hi, which we set to zero 
in our shrinkage prior. Aj = diag(Aj) = diag(Aj^i, Aj^p), Pi is a positive definite symmetric 
matrix. IW( ■ ) denotes the inverse Wishart distribution, with location matrix So and degrees 
of freedom no- Pi is typic ally either the i dentity matrix or Pi = X-Xi. The latter choice has 
been termed a g'-prior by IZellnerl (jl986[ ) and has the advantage of automatically adjusting 
for the different scales of the covariates. Setting Aj = n makes the information content of 
the prior equivalent to a single data point and is usually called the unit information prior. 
The choice of Pi = Iq. can prevent the design matrix from falling into singularity problem 
when some of the basis functions are h ighly correlated, whi ch can easily happen with many 
spline knots. See also the discussion in iDenison et al. Our default choice is therefore 



X'Xr,, Pa. 



Ig^ and Ps 



Other shrinkage priors on the regression coefficients 



can be used in our app roach, for example the Laplace distribution leading to the popular 
Lasso (ITibshiranil 119961 ) . but they will typically not allow us to integrate out the regression 
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coefficents analytically, see Section ISTTl The optimal choice of shrinkage prior depends on the 
unknown data generating model (a normal prior is better when all coefficients have roughly 
the same magnitude; Lasso is better when many coefficients are close to zero, but some are 
really large etc). 

We also estimate the shrinkage parameters, Ao, Xa and As via a Bayesian approach. Note 
that our prior constructions for B allow for separate shrinkage of the linear, additive and 
surface components. This gives us automatic regularizat ion/ shrinkage of the regression coef- 
ficients and helps to avoid problems with overfitting. Our MCMC scheme in Section [3] allows 
for a user-specified prior on Ajj, for i G {o,a,s} and j = l,2,...,p of essentially any func- 
tional form. However the default prior of Xij in this paper follows a log normal distribution 
with mean of n/2 and standard deviation of n/2 in order to ensure that both tight and flat 
shrinkages are attainable within one standard deviation in the prior. For computational con- 
venience, we use a log link for Ajj and make inference on log(Ajj). As a result the preceding 
prior on Xij yields a normal prior for log(Ajj) with mean [log(n) — 3/2 • log(2)] and variance 
log(2). 

We use the same number of additive knots for each covariate in the simulations and the 
application in Section H] and |5l but it should be clear that our approach also permits unequal 
number of knots in the different covariates. There is no particular requirements for the prior 
on the knots, but a vague prior should permit the knots to move freely in covariate space. 
Our default prior assumes independent knot locations following a normal distribution. The 
mean of the knots comes from the centers of a k-means clustering of the covariates. In the 
additive case, the prior variance of all the knots in the kth covariate is c^(a'a)~^, where a is 
the kth column of Xo- Similarly, the prior covariance matrix of a surface knot is c^{X'^Xo)~^ . 
We use = as the default setting. 

The hyperparameter Sq in the IW prior for S is set equal to the estimated error covariance 
matrix from the ffited linear model Y = XoBq. A small degrees of freedom (no) gives diffuse 
prior on S and uq = 10 is set as the default. 

For notational convenience and further computational implementation, we write the prior 
for the regression coefficients in condensed form as A ~ N (/x*, Sf,) where A = (A'^, A^, A(,)', 
fi* = in'^, n'^, n'J , Sb = (Ai/25]^^i/2) ^ p i ^ ^ diag(A), is a three-block diagonal 
matrix with S on each block, P = dia^j P„, Pn, P.^) is a block diagonal matrix and A ^ C 
denotes the Khatri-Rao product ( iKhatri fc Rad 119681 ) which is Kronecker product of the 
corresponding blocks of matrices A and C. It will also be convenient to define /3 = vecB. 
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Note that b and (3 contain the same elements with two different stacking orders. As a result, 
/3|S, A ~ N {fi, S^) where fi and essentially have the same entries as /x* and S5 have, 
respectively (Section lA.3p . 



3. The posterior inference 
3.1. The posterior. The posterior distribution can be decomposed as 

p{B, S, I, X\Y, X) = p{B\t A, S, Y, X)p{t A, S|r, X) 

where 

vecB|^,A,S,l-,X~N(/3, S^), 



= [S-i ® X'X + , /3 = vecS = S^[vec(X'rS-i) + 2"^] flZellnei]ll97lh . and 

P (^, A, S|r, X) = C X p{^, A) X |S^rl/2|5.|-(n+„o+p+l)/2|5.^|-l/2 

X exp trS-^ (noSo + nS) + - m)' 5]^' (-^ ~ ^) } 

where 5 = c = 2-("o+'^+«)p/27r-P("+'?)/2^;l(r^o/2)|r^o5o|"o/^ 1^(0) = 

^p(p-i)/4 Jl^^^ r -|_ (^1 _ j)/2] is the multivariate gamma function. It is important to note 
that it is in general not possible to integrate out I] analytically in our model. This is a 
consequence of using different shrinkage factors for the different responses and on the original, 
additive and surface parts of the model (the prior covariance matrix of B does not have a 
Kronecker structure). Only in the special case with a univariate response {p = 1) can we 
integrate out S analytically, since S is then a scalar. To obtain a uniform treatment of the 
models and their gradients, we have chosen to not integrate out S even for the case p = 1. 
The next subsection proposes an MCMC algorithm for sampling from the joint posterior 
distribution of all parameters. 

3.2. The MCMC algorithm. Our approach is to sample from p (^, A, X) using a 
three-block Gibbs sampling algorithm with Metropolis-Hastings (MH) updating steps. Draws 
from p{B\$,, Xj'EjY, X) can subsequently be obtained by direct simulation. The updating 
steps of the Gibbs sampling algorithm are: 

(1) Simulate S from p(S||, A, Y, X). 

(2) Simulate ^ from p(^|A, S, 1^, X). 

(3) Simulate A from p{\\^, S, Y, X). 
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In the special case when p 



S|^, A,r,X ~ IW no5, 



n 



^ iG{o,a,s} 



-1/2 



no + n 



(4) 

where Mi and Bi are the prior and posterior mean of Bi, respectively. Actually, when 
p = 1, S is a scalar and the IW density reduces to a scaled distribution. When p > 1, 
pC^I^, \,Y , X) is no longer IW, but the distribution in (jlj) is an excellent approximation 
and can be used as a very efficient MH proposal density. 

The conditional posterior distributions for ^ and A in Steps (2) and (3) above are highly 
non-standard and we update these parameters using Metropolis-Hastings steps with a tai- 
lored proposal, which we now describe for a general parameter vector 6 with posterior 
p{0\Y), which could be a conditional posterior in a Metrop olis- within-Gibbs a lgorithm (e.g. 
p{$,\X, S, Y,X)). This method w as or iginally proposed by iGamermanI ( 1l997l ) and later ex- 



tended by iNott fc Leontd (120041 ) and 



Villani et al. 



(120121 ). All of these three articles are 



confined to a generalized linear model (GLM) or GLM-like context where the parameters 
enter the likelihood function through a scalar-valued link function. A contribution of our 
paper is to show that the algorithm can be extended to models without such a nice structure 
and that it retains its efficiency even when the parameters are high-dimensional and enter 
the model in a highly nonlinear way. The way the knot locations and the shrinkage parame- 
ters are buried deep in the marginal posterior (see Equation 3.1 above) makes the necessary 
gradients (see below) much more involved and numerically challenging (see Appendix Rl) . 

At any given MCMC iteration we use Newton's method to iterate R steps from the current 
point 6c in the MCMC sampling towards the mode of p{6\Y), to obtain 6 and the Hessian 
at 6. Note that may not be the mode but is typically close to it already after a few Newton 
iterations since the previously accepted is used as the initial value; setting = 1, 2 or 3 
is therefore usually sufficient. This makes the algorithm very fast. Having obtained good 
approximations of the posterior mode and covariance matrix from the Newton iterations, the 
proposal 6p is now drawn from the multivariate t-distribution with u > 2 degrees of freedom: 

^'^ fd'\np{e\Y)\-' 



6p\0c 



dddS' 



0=0 
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where the second argument of the density is the covariance matrix and 6 is the terminal 
point of the R Newton steps. The Metropohs-Hastings acceptance probabihty is 

The proposal density at the current point g{6c\0p) is a multivariate t-density with mode 
6 and covariance matrix equal to the negative inverse Hessian evaluated at 6, where 6 is 
the point obtained by iterating R steps with the Newton algorithm, this time starting from 
Op. The need to iterate backwards from 6p is clearly important to fulfill the reversibility of 
the Metropolis-Hastings algorithm. When the number of parameters in 6 is large one can 
successively apply the algorithm to smaller blocks of parameters in 6. 

The tailored proposal distribution turns out to be hugely beneficial for MCMC efficiency, 
see Section |57^ for some evidence, but a naive implementation can easily make the gradient 
and Hessian evaluations an insurmountable bottleneck in the computations, and a source of 
numerical instability. We have found the outer product of gradients approximation of the 
Hessian to work very well, so all we need to implement efficiently are the gradient vector 
of A, S, 1^, X) and p{X\^,'E,Y , X). Appendix lAl gives compact analytical expression 
for these two gradient vectors, and shows how to exploit sparsity to obtain fast and stable 
gradient evaluations. Our gradient evaluations can easily be orders of magnitudes faster than 
state-of-the-art numerical derivatives, and substantially more stable numerically. For exam- 
ple, already in a relatively small- dimensional model in Section [5] with only four covariates, 20 
surface knots and 4 additive knots, the analytical gradient for the knot parameters are more 
than 40 times faster compared to a numerical gradient with tolerance of 10~^. Since the gra- 
dient evaluations accounts for 70-90% of total computing time, this is clearly an important 
advantage. 



3.3. Model comparison. The number of knots is determined via the D-fold out-of-sample 
log predictive density score (LPDS), defined as 

where is an (n^ x p)-dimensional matrix containing the observations in the dth testing 
sample and YL^ denotes the training observations used for estimation. If we assume that the 
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observations are independent conditional on 6, then 



where is the index set for the observations in Y^, and the LPDS is easily computed by 
averaging Y\^^^^p{yi\0,x.-^ over the posterior draws from p{6\Y_f}). This requires sampling 
from each of the D posteriors p{0\Y^d) for c? = 1, ...,D, but these MCMC runs can all be 
run in isolation from each other and are therefore ideal for straightforward parallel com- 
puting on widely available multi-core processors. The main advantage for choosing LPDS 
instead of the marginal likelihood is that the LPD S is no t nea rly as sensitive to the choice 
of prior as the marginal likelihood, see e.g. iKasj ( 119931 ) and iRichardson fc GreenI (119971 ) 
for a general discussion. The marginal likelihood can also lead to poor predictive inference 
whe n the true data gene r ating process is not included in the class of compared models, see 



e.g. iGeweke &: Amisand ( 120111 ) for an illuminating perspective. The main disadvantage of 
using the LPDS for selecting the number of knots is that, unlike the marginal likelihood and 
RJMCMC, there is no rigorous way of including the uncertainty regarding the number of 
knots in the final inferences. The dataset is systematically partitioned into five folds in our 
firm leverage application in Section [5l 



4. Simulations 

As discussed in the Introduction, the most commonly used approach for spline regression 
modeling is to use a large number of fixed knots and to use shrinkage priors or Bayesian vari- 
able selection to avoid overfitting ( iDenison et al.l 120021 ). We compare the performance of the 
traditional fixed knots approach to our approach with freely estimated knot locations using 
simulated data with different number of covariates and for varying degrees of nonlinearity in 
the true surface. We use shrinkage priors with estimated shrinkage both for the fixed and 
free knot models, but no variable selection. Models with univariate and multivariate response 
variables are both investigated. 

4.1. Simulation setup. We consider data generating processes (DGP) with both univariate 
{p = 1) and bivariate {p = 2) responses, and datasets with Qo = 10 regressors and two sample 
sizes, n = 200 and n = 1000. We first generate the covariate matrix from a mixture of 
multivariate normals with five components. The weight for the rth mixture component is 
Ur/ Ya=i 'where Ui, are independent U(0, 1) variables. The mean of each component 
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is a draw from U(— 1, 1) and the components' variances are all 0.1. We randomly select five 
observations without replacement from as the true surface knots and then create the 
basis expanded design matrix X using the thin-plate radial basis surface spline, see Section 
12.11 The coefficients matrix B is generated by repeating the sequence { — 1, 1}. The error 
term E is from multivariate normal distribution with mean zero, variance 0.1 and covariance 
0.1. These s ettings guarantee a reasonable signal-to-noise ratio. 



Following IWood et al.l ( 120021 ). we measure the degrees of nonlinearity (DNL) in the DGP 
by the distance between the true surface /(■) and the plane g{-) fitted by ordinary least 
squares without any knots in the model, i.e. 



DNL = ^n-i - g{x,)Y. (5) 

A larger DNL indicates a DGP with stronger nonlinearity. 

We generate 100 datasets and for each dataset we fit the fixed knots model with 5, 10, 
15, 20, 25 and 50 surface knots, and also the free knots model with 5, 10, and 15 surface 
knots. All fitted models have only linear and surface components. The knot locations are 
determined by /c-means clustering. We compare the models with respect to the mean squared 
loss 

where /(■) is the true surface and /(■) is the posterior mean surface of a given model with 
qs surface knots. The Loss in ([6]) is evaluated over a new sample of covariate vectors, and 
it therefore measures out-of-sample performance of the posterior mean surface. We will here 
set ra* = n. Note that the shrinkages and the covariance matrix of the error terms are also 
estimated in both the fixed and free knots models. 

4.2. Results. We present the results for p = 2 and n = 200. The results for p = 1 and 
n G {200, 1000}, and p = 2 and n = 1000 are qualitatively similar and are available upon 
request. The Supporting Information documents the results for p = 2 and n = 1000 for a 
few different model configurations. Figure [1] displays boxplots for the log ratio of the mean 
squared loss in (Q. The columns of the figure represents varying degrees of nonlinearity 
in the generated datasets according to the estimated DNL measure in equation Each 
boxplot shows the relative performance of a fixed knots model with a certain number of knots 
compared to the free knots model with 5 (top row), 10 (middle row) and 15 (bottom row) 
surface knots, respectively. The short summary of Figure [1] is that the free knots model 



EFFICIENT BAYESIAN MULTIVARIATE SURFACE REGRESSION 



13 



outperforms the fixed knots model in the large majority of the datasets. This is particularly 
true when the data are strongly nonlinear. The performance of the fixed knots model improves 
somewhat when we add more knots, but the improvement is not dramatic. Having more fixed 
knots clearly improves the chances of having knots close to the true ones, but more knots 
also increase the risk of overfitting. 

The aggregate results in Figure [1] do not clearly show how strikingly different the fixed and 
free knots models can perform on a given dataset. We will now show that models with free 
rather than fixed knots are much more robust across different datasets. Figure |2] displays 
the Euclidean distance of the multivariate out-of-sample predictive residuals y/i'i for a few 
selected datasets as a function of the distance between the covariate vector and the sample 
mean of the covariates. The normed residuals depicted in the leftmost column are from 
datasets chosen with respect to the ranking of the out-of-sample performance of the fixed 
knots model. For example, the upper left subplot shows the predictive residuals of both 
the model with 15 fixed knots (vertical bars above the zero line) and the model with 5 free 
knots (vertical bars below the zero line) on one of the datasets where the fixed knot models 
outperform the free knots model by largest margin (3rd best Loss in favor of fixed knots 
model). It is seen from this subplot that even in this very favorably situation for the fixed 
knots model, the free knots model is not generating much larger predictive residuals. Moving 
down to the last row in the left hand column of Figure O we see the performance of the two 
models when the fixed knots model performs very poorly (3rd worse Loss with respect to 
the fixed knots model). On this particular dataset, the free knots model does well while the 
fixed knots model is a complete disaster (note the different scales on the vertical axes of the 
subplots). The column to the right in Figure [2] shows the same analysis, but this time the 
datasets are chosen with respect to the ranking of the Loss of the free knots model. Overall, 
Figure [2] clearly illustrates the superior robustness of models with free knots: the free knots 
model never does much worse than the fixed knots model, but using fixed rather than free 
knots can lead to a dramatically inferior predictive performance on individual datasets. 

4.3. Computing time. The program is written in native R code and all the simulations 
were performed on a Linux desktop with 2.8 GHz CPU and 4 GB RAM on single instance 
(without parallel computing). Table [1] shows the computing time in minutes for a single 
dataset. In general the computing time increases as the size of the design matrix increases, 
but it increases only marginally as we go from p = 1 to p = 2. 
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Figure 1 . Boxplot of the log loss ratio comparing the performance of the fixed 
knots model with the free knots model for the DGP with p = 2 and n = 200. 
The three columns of the figure correspond to different degrees of nonlinearity 
of the realized datasets, as measured by estimated DNL in (j5]). 

5. Application to firm capital structure data 



5.1. The data. The classic paper by lRajan fc Zingalesl ( 119951 ) analyze firm leverage (leverage 
= total debt/ (total debt + hook value of equity)) as a function of its fixed assets (tang = tan- 
gible assets/hook value of total assets), its market-to-book ratio (market2book = (book value 
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Figure 2. Plotting the norm of the predictive multivariate residuals as a 
function of the distance between the covariate vector and its sample mean. 
The results are for the DGP with p = 2 and n = 200. The lines in each 
subplot are the normed residuals from the model with 15 fixed surface knots 
(vertical bars above the zero line), and the model with 5 free knots (vertical bars 
below the zero line). The column to the left shows the results for three datasets 
chosen when performance is ranked according to the fixed knots model, and the 
right column displays the results for three datasets chosen when performance 
is ranked according to the free knots model. 
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Table 1. Elapsed computing time (in minutes) for 5,000 iterations with a 
single dataset of 10 covariates. 
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24 
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of total assets - book value of equity + market value of equity) /hook value of total assets), 
logarithm of sales (LogSale) and profit (Profit = earnings before interest, taxes, deprecia- 
tion, and amortization/book value of total assets). Strong nonlinearities seem to be a quite 



linear/nonparametric models, see e.^ 
We use a similar data to the one in 



Bastos &: Ramal 



Raj an fc Zingalej ( 



rxiciet 
(2010 


3 nave 
), and 


suggesxea usii 
Villani et al. 


ig non 
(2012) 



19951 ) which covers 4, 405 American 



non-financial firms with positiv e sales in 199 2 and complete data records and analyze the 

(120121 ) analyze the same data with a smooth 



Villani et al. 



leverage in terms of total debt, 
mixture of Beta regressions. 

Figure [3] plots the response variable leverage in both original scale and logit scale (ln[?//(l — 
y)]) against each of the four covariates. The relationships between the leverage and the 
covariates are clearly highly nonlinear even when the logit transformation is used. There are 
also outliers which can be seen from the subplots with respect to covariates Market2Book and 
Profit. 



5.2. Models with only surface or additive components. We first fit models that either 
have only a surface component or only an additive component (both types of models also 
have a linear component). Note that the shrinkage parameters are also estimated in all cases. 
All four covariates are used in the estimation procedure and we use the logit transformation 
of the leverage, and standardize each covariate to have zero mean and unit variance. 

Figure S] depicts the LPDS for the surface component model and the additive component 
model for both the case of fixed and free knots. The LPDS generally improves as the number 
of knots increases for both the fixed and free knots models, but seems to eventually level off at 
large number of knots. The free knots model always outperforms the fixed knots model when 
only a surface component is used (left subplot). For example, the model with 12 free surface 
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Figure 3. Scatter plots of the firm leverage data with leverage ("5^) on both 
original scale (top subplots) and logit transformed scale (bottom subplots) 
against each of the four covariates. 



knots is roughly 32 LPDS units better than the fixed knots model with the same number 
of knots. This is a quite impressive improvement in out-of-sample performance considering 
that the fixed knot locations are chosen with state-of-the-art clustering methods for knot 
selection. The ability to move the knots clearly also helps to keep the number of knots to a 
minimum; it takes for example more than 30 fixed surface knots to obtain the same LPDS 
as a model with 12 free surface knots. 

Turning to the strictly additive models in right subplot of Figure H] we see that the additive 
models are in general inferior to the models with only surface knots, and that the differences 
in LPDS between the fixed and free knots approaches are much smaller here, at least for eight 
knots or more. The improvement in LPDS levels off at roughly 16 knots. It is important 
to note that the horizontal axis in Figure H] displays the number of additive knots in each 
covariate, and the fact that we do not overfit bear testimony to the effectiveness of the 
shrinkage priors. 
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Figure 4. LPDS for the firm leverage data with surface component model 
(left) and additive component model (right). Note that the number of knots 
in additive model is the number of spline basis functions on each covariate. 



5.3. Models with both additive and surface components. We now consider models 
with both additive and surface components. It is worth mentioning that we draw from the 
joint posterior distribution of the surface and additive knots, see Section |X]for MCMC details. 

Figure [5] shows that there are generally improvements from using both surface knots and 
additive knots in the same model. For example, the model with 4 free surface knots has an 
LPDS of —1, 284. Adding two free additive knots increases the LPDS to —1, 270 and adding 
another two additive knots gives a further increase of 14 LPDS units. Figure [5] also shows 
strong gains from estimating the knots' locations, but the improvement in LPDS from free 
knots tends to be less dramatic when more additive knots are used to complement the surface 
knots. There is little or no improvement in LPDS as the number of surface knots approaches 
60. The results in Figure [5] reinforces the evidence in Figure H] that the shrinkage prior is 
very effective in mitigating potential problems with overfitting. 

To simplify the graphical presentation of the results, we choose to illustrate the posterior 
inference of the knot locations in a model with only the two covariates Market2Book and 
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Figure 5. LPDS for the firm leverage data for the free and fixed knots models 
with varying number of surface and additive knots. 
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Profit. We use 20 surface knots and 4 additive knots in eacli covariate. Tlie mean acceptance 
probabilities for the knot locations and the shrinkage parameters in Metropolis-Hastings 
algorithm are 0.73 and 0.64, respectively, which are exceptionally large considering that all 
2x 20 + 2x4 = 48 knot location parameters are proposed jointly, as are all the shrinkage 
parameters. The acceptance probability in the updating step for S is 1 since we are proposing 
directly from the exact conditional posterior when p = 1 . Because of the knot switching 
problem (see Section [3]), it does not make much sense to display the posterior distribution 
of the knot locations directly. We instead choose to partition the covariate space into small 
rectangular regions, count the frequency of knots in each region over the MCMC iterations, 
and use heat maps to visualize the density of knots in different regions of covariate space. 
Figure [6] displays this knot density heat map. As expected, the estimated knot locations 
are mostly concentrated in the data dense regions, particularly in regions where the relation 
between the covariates and response in the data is most nonlinear, which is seen by comparing 
Figure [6] and Figure |3l 

Finally, we present the posterior surface for the firm leverage data in Figure [71 To enhance 
the visual representation, the graphs zoom in on the region with the majority of the data 
observations. Figure [7]plots the mean (left) and the standard deviation (right) of the posterior 
surface. The latter object is for brevity sometimes referred to as the posterior standard 
deviation surface. Figure [7] (right) also displays the covariate observations to give a sense 
of where the data observations are located. The Supporting Information to this article 
investigates the robustness of the posterior results to variations in both the prior mean and 
variance of the knot locations. The posterior heat map of the knot locations are affected 
by the fairly dramatic variations in the prior mean of the knots, and to a lesser extent by 
changes in the prior variance of the knot locations, but the posterior mean and standard 
deviation surfaces are robust to variations in the prior on the knots, especially in data dense 
regions. The Supporting Information also shows that the posterior is robust to changes in 
the prior on the shrinkage factors. 

5.4. MCMC efficiency in the updating of the knot locations. In order to study the 
efficiency of our algorithm for sampling the knot locations, we compare three types of MCMC 
updates of the knots: i) one-knot-at-a-time updates using a random walk Metropolis proposal 
with tuned variance (SRWM), ii) one-knot-at-a-time updates with the tailored Metropolis- 
Hastings step (SMH) in Section [3l2l and iii) full block updating of all knots using the tailored 
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Figure 6. Heat map to visualize the posterior density of the knot locations 
in covariate space for model with 4 free additive knots and 20 free surface 
knots for the firm leverage dataset. The plot is constructed by partitioning the 
covariate space into 70 x 70 rectangular regions and counting the number of 
surface knots in each rectangle over the MCMC draws. The posterior density 
of the locations of the additive knots is constructed in a similar fashion and 
separate heat maps for the additive knots in each covariate are shown just 
above the horizontal axis and vertical axis, respectively. 



Metropolis-Hastings step (BMH) i n Section 13.21 S RWM mov es are used in state - of-the 
art RJMCMC approaches such as 



Dimatteo et al 



(120011 ) and 



Gulam Razul et al. 



fl2003[ ). 



Note that we are not studying the performance of a complete RJMCMC scheme; we are 
here interested in isolating this particular upd ating step and comparing it to our tailored 
proposal. We use the inefficiency factor (IF) (Geweke 19921 ) to measure the efficiency of 
MCMC. The IF is a measure of the number of draws needed to obtain the equivalent of a 
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Figure 7. The posterior mean (left) and standard deviation (right) of the 
posterior surface for the model with 4 free additive knots and 20 free surface 
knots for firm leverage data. The subplot to the right also shows an overlay of 
the covariate observations. 



single independent draw. It is defined as IF = 1 + 2 Yli^i Pi where pi is the autocorrelation 
of the MCMC trajectory at lag i. We also document the effective sample size per minute, 
i.e. (number of MCMC draws) / (IF x computing time) to measure the overall efficiency of 
the MCMC. 

Table |2] shows the efficiency of the three knot sampling algorithms in a model with 20 
free surface knots and 4 additive knots in each covariate on the firm leverage data. The 
inefficiency factor in Table [2] is the average inefficiency of the posterior mean surface in 1000 
random chosen points in covariate space. There is some gain from tailoring the proposal for 
each knot separately, but the really striking observation from Table |2] is the massive efficiency 
and speed gains from updating all the blocks jointly using a tailored proposal; the effective 
sample size per minute is roughly 70 times larger when our BMH algorithm is used instead 
of simple SRWM updates. 
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Table 2. Comparison of algorithms for updating the knot locations in a model 
with 20 free surface knots and 4 additive knots in each covariate. Firm leverage 
data. 





SRWM 


SMH 


BMH 


Mean IF for the posterior mean surface 


29.63 


2.70 


1.16 


Mean acceptance probability 


0.26 


0.62 


0.88 


Computing time (mm) 


388.21 


1716.07 


141.72 


Effective sample size per minute 


0.87 


2.16 


60.83 



6. Concluding remarks 

We have presented a general Bayesian approach for fitting a flexible surface model for 
a continuous multivariate response using a radial basis spline with freely estimated knot 
locations. Our approach uses shrinkage priors to avoid overfitting. The locations of the 
knots and the shrinkage parameters are treated as unknown parameters and we propose a 
highly efficient MCMC algorithm for these parameters with the coefficients of the multivariate 
spline integrated out analytically. An important feature of our algorithm is that all knot 
locations are sampled jointly using a Metropolis-Hastings proposal density tailored to the 
conditional posterior, rather than the one-knot-at-a-time random walk proposals used in 
previous literature. The same applies to the block of shrinkage parameters. Both a simulation 
study and a real application on firm leverage data show that models with free knots have a 
better out-of-sample predictive performance than models with fixed knots. Moreover, the free 
knots model is also more robust in the sense that it performs consistently well across different 
datasets. We also found that models that mix surface and additive spline basis functions in 
the same model perform better than models with only one of the two basis types. 

Our approach can be directly used with other splines basis functions, other priors, and it is 
at least in principle straightforward to augment the model with Bayesian variable selection. 
Also, the assumption of Gaussian error distribution could be easily removed by using a 
Dirichlet process mixture (DPM) prior. We would still be able to inte grate out the regres sion 
coefficients if we assume a Gaussian base measure in the DPM, see 
details in the univariate case. 
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Appendix A. Details of the MCMC algorithm 



In this section we briefly address the MCMC details and related coniputati onal issues. For 
details on matrix manipulations and derivatives, see e.g. iLiitkepohll (j 19961 ). Our MCMC 
algorithm in Section [372] only requires the gradient of the conditional posteriors w.r.t. each 
parameter. Since users can always use their own prior on the knots and shrinkages, we will 
not document the gradient of any particular prior. In particular for the normal prior, one 
can directly find the results in e.g. iMardia et al.l (Il979l ). We now present the full gradients 
for the knot locations and the shrinkage parameters. 



A.l. Gradient w.r.t. the knot locations. 

ainp(^|A,S,r,X) d\ogp{^) p 
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where E = Y — XB, Ip is the identity matrix, Kp^p is the commutation matrix and 

9vec[S-i®X'X] ^.^^ ^M^vecX 
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[vec (X'rS-i) +S^V]'S^ 
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We can decompose the gradient for the design matrix w.r.t the knots as 

dvecXs/dveci^'^)' 0(„g^x«,) 

0{nqaXls) dvecXa/d^a 

where Ig and 1^ are numbers of parameters in the knots locations for surface and additive 
component, respectively. This decomposition makes user-specified basis functions for different 
components possible and one may update the locations in a parallel mode (efficient for small 
models) or batched mode (for models with many parameters). In particular for the thin-plate 
spline, we have 



dvecXi 



1 + 21n||x, 



[l + 2\n\\x,-^ij\\){x,-^,_ 



i£{a,s}, 

ie{i,...,90. 



Note that the gradient can be obtained efficiently by applying Lemma lA.ll and Algorithm 
lA.ll in Section [A. 31 below whenever dvec'S'^^ / d$,' and the commutation matrix appear. 



A. 2. Gradient w.r.t. the shrinkage parameters. 
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and 9vecS^^/9A' can be obtained efficiently by applying Lemma [A. II in Section [A.SI and by 



c?vecf(A""^^^S"^A""^^^) (g) P-] 
^ ^ = {Ip ® Kg^^p ® JqJ {Ip2 ® vecPi) {Ip2 + Kp^p) 

xf/,^[A-V^S-])^^^^,.e{a,.}. 



where dvecAi/dX^ is x p matrix with elements Vj(p+i)_p, j = —l/2X^j^'^ for j = 1, 
and zero elsewhere. 

A. 3. Computational remarks. The computational implementation of gradients in Section 
lA.ll and Section IA.2I is straightforward but the sparsity of some of the matrices can be 
exploited in moderate to large datasets. We now present a lemma and an algorithm that 
can dramatically speed up the computations. It is convenient to define A{i,:) and A{:,j) 
as matrix operations that reorders the rows and columns of matrix A with indices * and j. 
Therefore, /3 = 6(c, :), /x = fi*{c,:) and = Sf,(c, c) for proper indices c, and IS^I = 
|S^| since permuting two rows or columns changes the sign but not the magnitude of the 
determinant. 

Lemma A.l. Given matrix C and the indexing vector z such that (vecl]f,)(z, :) = vecS/3 
holds, we can decompose the following gradient as 



avec[S^^ 



^ gvec[(A7^/^S-iA7^/^)8)P,] 9vec[(A-'/'s-iA-'/') 



06/ ' " dOa' 



where 6 is any parameter vector of the covariance matrix Cg = {[C{:, z)]{:, hs)}{:, Zs 
0), hs = [{p^qqo + I), {P^QQo + "2), ...,p^q{qo + Qs)]' , = vec([Opg,xpgo, Ipg^xpg,, Opg^xpgJ'). 
Ca = {[C(:, z)]{:, h,)}{:, z, ^ 0), = [{p^qo + qs) + 1), (p'g(go + qs) + 2), ...,pV]' and 

Za = Vec([0pq^xp((jo+g3); lp<JaXp<ja] )■ 

Algorithm A.l. An efficient algorithm to calculate Km,nQ (or QKm,n) where Km,n is the 
commutation matrix and Q is any dense matrix that is conformable to Km,n- 

(1) Create an m x n (or n x m) matrix T and fill it by columns with the sequence 
{l,2,...,nm}. 

(2) Obtain the indexing vector t = vec(T'). 

(3) Return Q{t,:) (orQ{:,t)). 
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1. Prior Robustness 

This section explores the sensitivity of the posterior inferences with respect to variations in 
the prior. There are clearly many aspects of the prior to explore, but we will here focus on the 
sensitivity with respect to the prior on the shrinkage factors and the prior on the knot locations, 
which are the most influential priors for the model. Since our model is very flexible and richly 
parametrized it is natural to expect, or even desireable, that the posterior responds to variations 
in the prior hyperparameters. But since the prior in complex models is always hard to specify, it 
is hoped that moderate changes in the prior should at least not overturn the posterior inferences. 

1.1. The prior on the shrinkage parameters. Figure [ST] displays the posterior sensitivity of 
the knot locations, the posterior mean and standard deviation surfaces to changes in the prior 
variance on the shrinkage factors. The posterior and predictive results are clearly very robust to 
changes in this particular aspect of the prior. 



1.2. The prior on the knot locations. Figure [S2] displays the effect on the posterior knot 
density from changes in both the mean (columns) and the variance (rows) of prior on the knot 
locations. While there are some differences in the posterior knot densities when the prior variance 
changes (changes across rows) , there is much larger difference in the posterior of the knots when 
the prior mean of the knot locations change. This is partly explained by fact that the differences 
between the three ways of placing the prior means are rather dramatic, but it is clear that 
the prior mean of the knot locations are affecting where the knots are located a posteriori. 
Considering the complexity in the inference on the knot locations and the fact that many of the 
knots probably correspond to regression coefficients that are close to zero, this is perhaps not 
too surprising. 

The posterior inference of the knot locations is typically not of interest. What matters is the 
inferences on the conditional predictive distribution p(y|x). Figure [S3l and [S4l investigate the 
sensitivity of the posterior mean and standard deviation surfaces to changes in the prior mean and 
variance of the knot locations. Here the robustness to variations in the prior is much larger. Both 
the predictive mean and the predictive standard deviation remain largely unchanged, considering 
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Figure SI. Posterior sensitivity with respect to changes in the prior variance of 
the shrinkage factors. The first column shows the posterior of the knot density 
(top row), the posterior mean surface (middle row) and the posterior standard 
deviation surface (bottom row) using the default prior in the paper. The second 
and third columns demonstrates the effect on the posterior inferences when the 
prior variance is half of the default value (second column) and twice the default 
value (third column). 
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Figure S2. Sensitivity of the posterior knot density with respect to changes in 
the mean (columns) and variance (rows) in the prior distribution of the knot 
locations. The prior mean of the locations in the second columns is chosen from 
the empirical marginal distribution of each covariate, and the prior mean in the 
third column are random draws without replacement among the data points. 



the magnitude of the changes in the prior. The main differences in the prior mean surface occur 
in points of covariate space where the uncertainty in the predictive mean is large. 
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Figure S3. Sensitivity of the posterior mean surface with respect to changes 
in the mean (columns) and variance (rows) in the prior distribution of the knot 
locations. The prior mean of the locations in the second columns is chosen from 
the empirical marginal distribution of each covariate, and the prior mean in the 
third column are random draws without replacement among the data points. 




Figure S4. Sensitivity of the posterior standard deviation surface with respect 
to changes in the mean (columns) and variance (rows) in the prior distribution 
of the knot locations. The prior mean of the locations in the second columns is 
chosen from the empirical marginal distribution of each covariate, and the prior 
mean in the third column are random draws without replacement among the data 
points. 
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2. Further simulation results 



2.1. Additional results from the simulation study in Section 4 of the paper. This 
section documents the simulation results for the simulation setup with p = 2 and n = 1000. The 
simulation setup is identical to the one in Section 4 of the paper with the exception that number 
of data points is increased from n = 200 to n = 1000. Figure [S5] compares the estimation loss 
from using fixed and free knots, respectively. Figure IS6I compares the out-of-sample predictive 
residuals from a models with 15 fixed surface knots to a model with 5 free knots, and Figure [ST] 
does the same type of comparison for a model with 20 fixed surface knots to a model with 10 
free knots. 

2.2. Simulation results from a situation where none of the tv^o models are correct. In 

this section, we briefly describe a simple simulation example where the true model is not nested 
in any of the two estimated models. We generate Gaussian data around the following mean 
surface 



The function in Equation (jSip is called a radial function. We generate 100 datasets using 
iV(0,0.1) disturbances around the mean. The number of observations are 1000 in each dataset. 
We use linear and surface components. The number of knots used in the free knot models is 10, 
20, and 40, and the number of knots used in the flxed knots model is 10, 20, 40, 60, 80 and 100. 

Figure [S8] displays boxplots of the losses for the different number of knots in each model. The 
free knots model outperforms the flxed knots model, but the improvement from using free knots 
are not that large here since the covariate space is only two-dimensional, which is small enough 
for the flxed knots to have a decent coverage. 



(SI) 



y = 42.659[0.1 + {xi - 0.5)(0.05 + (xi - 0.5)^ 
- 10(xi - 0.5f{x2 - O.bf + 5(x2 



0.5)2)] 
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Figure S5. Boxplot of the log loss ratio comparing the performance of the fixed 
knots model with the free knots model for the DGP with p = 2 and n = 1000. 
The three columns of the figure correspond to different degrees of nonlinearity of 
the realized datasets, as measured by estimated DNL in (5) in the paper. 
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Figure S6. Plotting the norm of the predictive multivariate residuals as a func- 
tion of the distance between the covariate vector and its sample mean. The results 
are for the DGP with p = 2 and n = 1000. The lines in each subplot are the 
normed residuals from the model with 15 fixed surface knots (vertical bars above 
the zero line), and the model with 5 free knots (vertical bars below the zero line). 
The column to the left shows the results for three datasets chosen when perfor- 
mance is ranked according to the fixed knots model, and the right column displays 
the results for three datasets chosen when performance is ranked according to the 
free knots model. 
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Figure S7. Plotting the norm of the predictive multivariate residuals as a func- 
tion of the distance between the covariate vector and its sample mean. The 
results are for the DGP with p = 2 and n = 1000. The lines in each subplot are 
the normed residuals from the model with 20 fixed surface knots (vertical bars 
above the zero line), and the model with 10 free knots (vertical bars below the 
zero line). The column to the left shows the results for three datasets chosen 
when performance is ranked according to the fixed knots model, and the right 
column displays the results for three datasets chosen when performance is ranked 
according to the free knots model. 
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Figure S8. Boxplots of the loss in the simulations for the radial mean function. 



